Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Wed Dec 01 18:33:35 2021"

About me

I have used R and RStudio quite a lot in my work. I’m familiar with Git and Github, but have not become ‘fluent’ with them yet – in fact, using them has always felt a bit cumbersome. I hope this course will help me get better with using Git/GitHub.

I heard about the course from my supervisor and department mailing list.

Here is a link to my github page for this project:https://github.com/jpkos/IODS-project

List test:

  • First item
  • Second item

Chapter 2: Linear regression practice

In this exercise, we are using linear regression to evaluate how different learning approaches affect student’s performance in a statistics course.

We will start off by loading the libraries

library(dplyr)
library(GGally)
library(ggplot2)

Read and describe the data

In the data wrangling section, we created a csv file with the data. Let’s load that.

data <- read.csv("data/learning2014_csv.csv")

Pick only the combined columns and the columns with participant info

cols <- c("Age", "Attitude", "Points", "gender", "deep", "surf", "stra")
data <- data[cols]

Data overview (dimensions [rows and cols] and structure [variables and their datatypes])

dim(data)
## [1] 166   7
str(data)
## 'data.frame':    166 obs. of  7 variables:
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...

THe dataset is about how students learning approaches affect their exam points in a statistics course. The data has combined variables that describe students learning approaches (deep, surface, strategic) and individual variables with participant information (age, gender, attitude towards statistics, exam points)

Summary of the variables:

summary(data)
##       Age           Attitude         Points         gender         
##  Min.   :17.00   Min.   :14.00   Min.   : 7.00   Length:166        
##  1st Qu.:21.00   1st Qu.:26.00   1st Qu.:19.00   Class :character  
##  Median :22.00   Median :32.00   Median :23.00   Mode  :character  
##  Mean   :25.51   Mean   :31.43   Mean   :22.72                     
##  3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:27.75                     
##  Max.   :55.00   Max.   :50.00   Max.   :33.00                     
##       deep            surf            stra      
##  Min.   :1.583   Min.   :1.583   Min.   :1.250  
##  1st Qu.:3.333   1st Qu.:2.417   1st Qu.:2.625  
##  Median :3.667   Median :2.833   Median :3.188  
##  Mean   :3.680   Mean   :2.787   Mean   :3.121  
##  3rd Qu.:4.083   3rd Qu.:3.167   3rd Qu.:3.625  
##  Max.   :4.917   Max.   :4.333   Max.   :5.000

Plot variable pairs to visualize correlations

p <- ggpairs(data, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

Some notes about the variables:

  • No major differences between genders
  • Biggest correlations between exam points and the strategic and surface approaches (but opposite signs)
  • Surface approach had negative correlation with attitude, whereas deep approach had a positive
  • Age had an opposite effect between surface and strategic approaches
  • However, Age distribution showed that most people were around the same age (early to mid 20s)

Linear regression models

Let’s fit a model with three variables (strategic approach, surface approach, attitude) and the exam points as the target variable

model <- lm(Points ~ stra + surf + Attitude, data=data)
summary(model)
## 
## Call:
## lm(formula = Points ~ stra + surf + Attitude, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## Attitude     0.33952    0.05741   5.913 1.93e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Attitude has a significant effect, the others do not. Let’s fit another model, this time only with Attitude

model2 <- lm(Points ~ Attitude, data=data)
summary(model2)
## 
## Call:
## lm(formula = Points ~ Attitude, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The result indicates that students attitude has a strong correlation with their points. 1-point increase in Attitude score increased the students exam points by about 0.35 (t = 6.124, p < 0.001).

The multiple R-squared measures how well the model fits the data, but also adjusts to the number of fitted variables. The multiple R-squared for the first model was 0.2074. This can be interpreted to mean that the model explains roughly 20.7% of the variance in the data. When we fit the model again with only Attitude as the variable, the R-squared is 0.1906 (~19.1%), meaning that Attitude alone explained a lot of the variance seen in the data.

Diagnosis plots:

plot(model2, c(1,2,5))

In the residuals vs fitted plot, we can see that with larger values the residuals may be getting smaller, although the number of data points is also smaller. Overall there does not seem to be serious heteroscedasticity (residuals should not depend on the fitted values, i.e. they should be randomly distributed).

The QQ-plot shows some deviation from normalcy at the extreme values, however the deviations are smallish and may not violate normality assumptions (errors should be normally distributed).

The residuals vs leverage plot shows that while there are a few points that are kind of outliers, they most likely do not seriously affect the model results.

Finally, let’s plot points vs attitude.

plot(data$Attitude, data$Points)


Chapter 3: Logistic regression practice

Let’s start by loading the necessary libraries

library(dplyr)
library(GGally)
library(ggplot2)
library(tidyr)
library(caret)

Dataset description

Load the data we processed in the data wrangling section (actually, I’m using the ready made dataset)

data <- read.csv("data/student/alc.csv")

Print a summary of the data structure

str(data)
## 'data.frame':    370 obs. of  51 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  15 15 15 15 15 15 15 15 15 15 ...
##  $ address   : chr  "R" "R" "R" "R" ...
##  $ famsize   : chr  "GT3" "GT3" "GT3" "GT3" ...
##  $ Pstatus   : chr  "T" "T" "T" "T" ...
##  $ Medu      : int  1 1 2 2 3 3 3 2 3 3 ...
##  $ Fedu      : int  1 1 2 4 3 4 4 2 1 3 ...
##  $ Mjob      : chr  "at_home" "other" "at_home" "services" ...
##  $ Fjob      : chr  "other" "other" "other" "health" ...
##  $ reason    : chr  "home" "reputation" "reputation" "course" ...
##  $ guardian  : chr  "mother" "mother" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 2 1 2 2 2 1 ...
##  $ studytime : int  4 2 1 3 3 3 3 2 4 4 ...
##  $ schoolsup : chr  "yes" "yes" "yes" "yes" ...
##  $ famsup    : chr  "yes" "yes" "yes" "yes" ...
##  $ activities: chr  "yes" "no" "yes" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "yes" "yes" "no" "yes" ...
##  $ romantic  : chr  "no" "yes" "no" "no" ...
##  $ famrel    : int  3 3 4 4 4 4 4 4 4 4 ...
##  $ freetime  : int  1 3 3 3 2 3 2 1 4 3 ...
##  $ goout     : int  2 4 1 2 1 2 2 3 2 3 ...
##  $ Dalc      : int  1 2 1 1 2 1 2 1 2 1 ...
##  $ Walc      : int  1 4 1 1 3 1 2 3 3 1 ...
##  $ health    : int  1 5 2 5 3 5 5 4 3 4 ...
##  $ n         : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ id.p      : int  1096 1073 1040 1025 1166 1039 1131 1069 1070 1106 ...
##  $ id.m      : int  2096 2073 2040 2025 2153 2039 2131 2069 2070 2106 ...
##  $ failures  : int  0 1 0 0 1 0 1 0 0 0 ...
##  $ paid      : chr  "yes" "no" "no" "no" ...
##  $ absences  : int  3 2 8 2 5 2 0 1 9 10 ...
##  $ G1        : int  10 10 14 10 12 12 11 10 16 10 ...
##  $ G2        : int  12 8 13 10 12 12 6 10 16 10 ...
##  $ G3        : int  12 8 12 9 12 12 6 10 16 10 ...
##  $ failures.p: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ paid.p    : chr  "yes" "no" "no" "no" ...
##  $ absences.p: int  4 2 8 2 2 2 0 0 6 10 ...
##  $ G1.p      : int  13 13 14 10 13 11 10 11 15 10 ...
##  $ G2.p      : int  13 11 13 11 13 12 11 10 15 10 ...
##  $ G3.p      : int  13 11 12 10 13 12 12 11 15 10 ...
##  $ failures.m: int  1 2 0 0 2 0 2 0 0 0 ...
##  $ paid.m    : chr  "yes" "no" "yes" "yes" ...
##  $ absences.m: int  2 2 8 2 8 2 0 2 12 10 ...
##  $ G1.m      : int  7 8 14 10 10 12 12 8 16 10 ...
##  $ G2.m      : int  10 6 13 9 10 12 0 9 16 11 ...
##  $ G3.m      : int  10 5 13 8 10 11 0 8 16 11 ...
##  $ alc_use   : num  1 3 1 1 2.5 1 2 2 2.5 1 ...
##  $ high_use  : logi  FALSE TRUE FALSE FALSE TRUE FALSE ...
##  $ cid       : int  3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 ...

The data describes each subjects school performance together with different study and personal information such as absences (from school), failures (at school), age, family size, education level, etc. G1, G2 and G3 are the grades in three different periods for the given course subject (portuguese or math). Alcohol level is a binary variable (high/low usage).

Analysis

We will analyse students’ alcohol consumption and its effects on 4 different variables. The dependent variable is high alcohol use (high_use), with the following independent variables and the hypotheses:

  1. Absences: Students that have more absences are more likely to have high alcohol use
  2. Failures: Students with more failues are more likely to have high alcohol use
  3. Sex: Men are more likely to have higher alcohol use
  4. Romantic: People who are in a relationship are less likely to have high alcohol use

First, let’s plot the distributions of these variables:

cols <- c("sex", "failures", "absences", "romantic", "high_use")
gather(data[cols]) %>% ggplot(aes(value)) + facet_wrap("key", scales="free") + geom_bar()

Fit a model with these variables:

model <- glm(high_use ~ absences + failures + sex + romantic, data=data, family="binomial")
summary(model)
## 
## Call:
## glm(formula = high_use ~ absences + failures + sex + romantic, 
##     family = "binomial", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1132  -0.8415  -0.5849   1.0329   2.1106  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.86714    0.24303  -7.683 1.56e-14 ***
## absences     0.09406    0.02323   4.049 5.15e-05 ***
## failures     0.60568    0.20812   2.910  0.00361 ** 
## sexM         0.98393    0.24789   3.969 7.21e-05 ***
## romanticyes -0.24606    0.26635  -0.924  0.35558    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 406.13  on 365  degrees of freedom
## AIC: 416.13
## 
## Number of Fisher Scoring iterations: 4

According to the results:

Hypotheses 1 - 3 were supported by the data. For hypothesis 4, the effect was negative as predicted, but the effect is not statistically significant.

The confidence intervals of these variables are:

confint(model)
## Waiting for profiling to be done...
##                   2.5 %     97.5 %
## (Intercept) -2.36326577 -1.4080742
## absences     0.05079132  0.1421351
## failures     0.20530644  1.0277095
## sexM         0.50384493  1.4776693
## romanticyes -0.77779991  0.2691722

We can also interpret the coefficients as odds ratios by running them through the exponent function

coef(model) %>% exp
## (Intercept)    absences    failures        sexM romanticyes 
##   0.1545656   1.0986256   1.8324980   2.6749408   0.7818748

Now the coefficients indicate e.g. being Male increases the odds of high alcohol use by 2.675, and so on.

Let’s compare predicted vs real results with a 2x2 table

predictions <- as.factor(predict(model, newdata = data, type = "response")>0.5)
true = as.factor(data$high_use)
confusionMatrix(data=predictions, reference=true)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE   247   77
##      TRUE     12   34
##                                           
##                Accuracy : 0.7595          
##                  95% CI : (0.7126, 0.8021)
##     No Information Rate : 0.7             
##     P-Value [Acc > NIR] : 0.006538        
##                                           
##                   Kappa : 0.3122          
##                                           
##  Mcnemar's Test P-Value : 1.169e-11       
##                                           
##             Sensitivity : 0.9537          
##             Specificity : 0.3063          
##          Pos Pred Value : 0.7623          
##          Neg Pred Value : 0.7391          
##              Prevalence : 0.7000          
##          Detection Rate : 0.6676          
##    Detection Prevalence : 0.8757          
##       Balanced Accuracy : 0.6300          
##                                           
##        'Positive' Class : FALSE           
## 

Total proportion of inaccurate results: 77 + 12 = 89. Total number of values was 370, so 89/370 = 0.2405 or about 24% of the data was predicted with the wrong label.

To see if this is a good result, we consider what result could be achieved with random chance. If we just flipped coins, we would get 50% correct. The logistic regression model is clearly better than that. However, we can also see that the proportion of non-high users is much larger than the proportion of high alcohol users. There are 259 non-high users versus 111 high users. Thus, ff we had a model that predicted everyone as non-high user, we would get 70% correct. Therefore, we can see that the model is actually not that much better than a model “no one is high user”.

Chapter 4: Clustering and classification practice

As usual, let’s start with loading the libraries. Not sure which ones I’ll end up using I will load those we have used so far:

library(dplyr)
library(GGally)
library(ggplot2)
library(tidyr)
library(caret)
library(MASS)

Dataset

Load the Boston dataset from the MASS package

data("Boston")

Explore the data

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The dataset describes the housing values in Boston and includes several values that could potentially affect housing values. These include: crime rate (crim), average number of rooms per house (rm), nitrogen oxide concentration (nox), property tax (tax), black (proportion of black residents), and so on.

Let us next create graphical plots of the dataset to visualize correlations:

ggpairs(Boston, mapping = aes(alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))

The strongest correlations with median housing value are:

  • lstat (lower status of the population): -0.738. More lower class (less educated, poor employment) population will decrease housing prices.
  • rm (average number of rooms): 0.695. More rooms will increase price.
  • ptratio (pupil-teacher ratio): -0.508. Bigger P-T ratios (pupils per teacher) decrease prices.
  • indus (proportion of non-retail business acres): -0.484. More land reserved for businesses will decrease housing value.

Each of the 4 strongest correlations make sense intuitively.

Show summary of the data:

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Some notes about this summary:

  • age: this is the % of houses built before 1940. The median % is 77.5, so most of the houses are pre-1940s (data was collected in 1970s, so not that old at the time)

Analysis

LDA

Scale the dataset

Boston <- data.frame(scale(Boston))
summary(Boston)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

Now each variable has mean 0 and std 1. Next, we will categorize the crime rate variable into quantiles

quants <- quantile(Boston$crim)
crimerate <- cut(Boston$crim, breaks=quants, include.lowest=TRUE)

Drop crimerate from original dataset

Boston2 <- Boston[,!(names(Boston) %in% c("crim"))]
Boston2$crim_cat <- crimerate

Divide into train/val datasets with 80% ratio

train_ind <- sample(nrow(Boston2), nrow(Boston2)*0.8)
Boston2.train <- Boston2[train_ind,]
Boston2.test <- Boston2[-train_ind,]

Fit Linear Discrimant Analysis (LDA)

lda.model <- lda(crim_cat ~., data=Boston2.train)

Plot bi-plot with arrows. Let’s use the example function provided on datacamp:

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

Plot with the fitted model

cats <- as.numeric(Boston2.train$crim_cat)
plot(lda.model, dimen=2, col=cats, pch=cats)
lda.arrows(lda.model)

Save categorical crime data form test set and remove it

cats.test <- Boston2.test$crim_cat
Boston2.test <- dplyr::select(Boston2.test, -crim_cat)

Now predict the categories for the test data and cross-tabulate

cats.pred <- predict(lda.model, newdata = Boston2.test)
table(correct = cats.test, predicted = cats.pred$class)
##                  predicted
## correct           [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
##   [-0.419,-0.411]              15             10               0              0
##   (-0.411,-0.39]                3             14               5              0
##   (-0.39,0.00739]               0              8              20              1
##   (0.00739,9.92]                0              0               1             25

Best results were achieved with the largest class (Crime rate 0.00739 – 9.92). The second lowest category (-0.411 – 0.390) had the most wrong predictions. Percentage-wise, however, the second largest category had more wrong predictions (10 correct predictions, 9 wrong, 47.37% of the predictions were wrong). he second lowest category had the most false positives, and the lowest category the most false negatives.

K-means

Next we will explore the data using clustering methods, namely the k-means clustering approach. First, fit the model using k=4:

km.model <- kmeans(Boston, centers=4)

Plot results

pairs(Boston, col=km.model$cluster)

We can optimize the number of clusters by finding k so that the sum of squared errors within clusters is minimized. Using the code from DataCamp:

k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

There is a sharp drop at k=2, and after that the drops are more gradual. We can plot kmeans with k=2.

km.model2 <- kmeans(Boston, centers=2)

Plot

pairs(Boston, col=km.model2$cluster)

It seems that 2 is indeed a good number of clusters; visual inspections shows that there are no clear outliers (i.e. possible third clusters that stick out)

Chapter 5: Dimension reduction practice

As usual, let’s start with loading the libraries:

library(dplyr)
library(GGally)
library(ggplot2)
library(tidyr)
library(caret)
library(MASS)
library(reshape2)

Dataset description

Load the data we processed in the data wrangling section (actually, I’m using the ready made dataset)

data <- read.csv("data/human/human2.txt")

Print a summary of the data structure

str(data)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

The dataset includes human development metrics such as life expectancy, GNI, education level from different countries. Definitions for some metrics:

  • Maternal mortality (Mat.Mor): mortality per 100 000 births
  • Adolescent birth rate (Ado.Birth): number of births to women of 15-19 years of age, per 1000 women in the same age group

Plot variable pairs to visualize correlations

p <- ggpairs(data, mapping = aes(alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

The strongest correlations can be seen in

  1. Life expectancy and maternal mortality (-0.857): Countries with high maternal mortality have lower life expectancy
  2. Adolescent birth rate and maternal mortality (0.759): Countries with lots of 15-19 year old mother have higher maternal mortality
  3. Maternal mortality and education level (-0.736): Countries with lower education level have higher maternal mortality
  4. Educational level and life expectany (0.624): Countries with higher educated population have longer life expectancy.

Scale the data

data.scaled <- data.frame(scale(data))

Analysis

Now we shall run the principal component analysis (PCA) on the data. First with the non-scaled data.

pca.original <- prcomp(data)

Draw a biplot

biplot(pca.original, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

Then again, this time for the scaled data

pca.scaled <- prcomp(data.scaled)

and plot

biplot(pca.scaled, choices = 1:2)

The results look much different. This is because the different metrics were measured using different units. The units are somewhat arbritrary (you can measure maternal mortality per 100k people or per 1 mil people), so they cannot be compared (apples and oranges). When we normalize the data, all metrics are squeezed between 0 and 1, after which we can better compare them.

When we look at the principal components in the last figure, we can draw see some patterns:

  • When we move to the left on the first principal component and up on the second component, we see Nordic countries that are highly developed in many areas.

  • When we move down along PC2, we see countries that are wealthy but may have issues with gender inequality (women with parliamentary representation and in labor force)

  • When we move to the right, we see countries that are less wealthy, and have problems with maternal mortality and girls giving birth at young age.

  • Lower right hand corner has countries that are poor, have gender inequality and the abovementioned problems connected to child birth.

Tea dataset

Load the tea dataset:

library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.1.2
data(tea)

Summary of the data:

summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming          exciting  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255   exciting   :116  
##  Not.feminine:171   sophisticated    :215   slimming   : 45   No.exciting:184  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##         relaxing              effect.on.health
##  No.relaxing:113   effect on health   : 66    
##  relaxing   :187   No.effect on health:234    
##                                               
##                                               
##                                               
##                                               
## 

Structure:

str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

Dimensions:

dim(tea)
## [1] 300  36

The dataset has 36 columns and 300 rows. The columns (variables) are related to people’s tea drinking habits. Most of the variables are categorical and most of them have two levels such as sugar use (yes/no).

Do multiple correspondence analysis and visualize

MCA(X=tea[, -which(names(tea) %in% c('age'))], method="indicator")

## Warning: ggrepel: 8 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

## **Results of the Multiple Correspondence Analysis (MCA)**
## The analysis was performed on 300 individuals, described by 35 variables
## *The results are available in the following objects:
## 
##    name              description                       
## 1  "$eig"            "eigenvalues"                     
## 2  "$var"            "results for the variables"       
## 3  "$var$coord"      "coord. of the categories"        
## 4  "$var$cos2"       "cos2 for the categories"         
## 5  "$var$contrib"    "contributions of the categories" 
## 6  "$var$v.test"     "v-test for the categories"       
## 7  "$ind"            "results for the individuals"     
## 8  "$ind$coord"      "coord. for the individuals"      
## 9  "$ind$cos2"       "cos2 for the individuals"        
## 10 "$ind$contrib"    "contributions of the individuals"
## 11 "$call"           "intermediate results"            
## 12 "$call$marge.col" "weights of columns"              
## 13 "$call$marge.li"  "weights of rows"

In the plot, we see the different variable levels drawn as points on two axes. The distance between the categories tells us how similar they are. Two of the lowest points in the graph are the age group 15-24 and student, meaning that these two levels are similar (young people tend to be students). We can also see some not-so-obvious patterns:

  • Black, green and Earl Grey teas are in completely different parts of the plot. This indicates that subjects with different characteristics (how they drink tea and their perception of the tea) affects what type of tea they drink. For example, older people prefer black tea, younger people prefer Earl Grey. Non-workers are closer to black tea and workers are closer to green tea.

  • Older people prefer unpackaged tea, younger people use packaged tea more often.

  • Workmen prefer cheap tea, older people prefer upscale tea.

  • Young people and/or students are clearly different from other groups.

  • People who drink tea 1 to 2 times/week are more likely to use sugar than those who drink 3 to 6 times/week. Older people don’t use sugar, workers are more likely to use sugar.

Chapter 6: Longitudinal data analysis practice